Introduction to Statistics

Bennett Kleinberg

Week 1

Week 1

  • Why do we even need statistics?
  • About the course
  • Basic ideas
  • Frequency distributions

Getting started

Another one

About Maria

Maria is 26 years old, single, outspoken, and very bright. She majored in law. As a student, she was deeply concerned with issues of discrimination and miscarriage of justice and participated in weekly animal-rights demonstrations.

Adapted from Tversky & Kahneman (1983)

Which is more probable?

  • A: Maria works in a law firm
  • B: Maria works in a law firm and does pro bono work for animal-rights activists

Hollywood ruins books (does it?)

Good books become bad movies!

(demo)

Berkson’s paradox

Also holds for attractiveness and niceness in dating

Book tip: Jordan Ellenberg “How not to be wrong”

YT video from Numberphile

Why should I care?

  • we are flooded with data
  • we want to make sense of the world around us
  • … esp. about human behaviour and society

Statistics is the best way to do this.

Suppose you wanted to know…

  • whether loneliness increased during lockdown?
  • how much more dangerous COVID-19 is for people with cancer?
  • how engagement in online communities relates to extremist world views?
  • whether a curfew increases rioting?

Statistics is not a good way to approach these questions.

It is the ONLY way to meaningfully approach these questions!

What does it even mean?

Statistics, the science of collecting, analyzing, presenting, and interpreting data. Britannica

A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data. Merriam-Webster

Note: this is \(\neq\) “statistics as a collection of data”

Synopsis of statistics

  • we work with data in a numerical sense
  • we want to obtain information from these data
  • and we want to understand the uncertainty that comes with data
    • this is one aspect where it differs from mathematical modelling

And lastly: the word data is the plural form of datum.

I still don’t care!

  • not being able to interpret data can be dramatic
  • statistics is the discipline that is about data

\(\frac{deaths}{cases}\) vs \(\frac{deaths}{population}\)

You will learn why T. was wrong!

The data never lie!?

  • people will use statistics to make their points
  • this can be used to mislead
  • you must become statistics-savvy to call bullsh*t

Nope: still not interested!

  • social + behavioural sciences have embraced quantitative methods
  • we seek to express processes/attributes/disorders as numbers
  • so we also need methods to make sense of these numbers

The special role for Psychology

Measurement challenge

  • Human behaviour and social processes are very complex
  • Compare this to a drop of oil or the properties of gold
  • We are often interested in the unobservables:
    • intelligence
    • well-being
    • emotions (fear, sadness, …)
    • loneliness
  • These are very hard to measure!
  • And we need methods to learn about humans in general (= the population)

This is the essence of inferential statistics.

Two stances towards statistics

  • Statistics as a tool
    • you use it to serve your purpose (e.g. making an inference based on your data)
    • you have a pragmatic relationship with statistics (e.g. it’s needed to do research and to understand the world)
  • Statistics as a discipline
    • about improving statistics
    • about better ways to model data, make inferences, quantify uncertainty
    • esp. now: making sense of massive volumes of data (never use the term Big Data)

The connection to AI

Video example on YT

My promise

  • Basic statistics today is what reading was yesterday
  • If you invest the time to fully understand the content in this module (always ask if things are unclear), you will be fine
  • Every more advanced approach builds on these basic ideas

If you are super-pragmatic: being able to do statistics pays well in the industry.

The course: structure

  1. Lectures (14x)
  2. Seminars (4x)
  3. SPSS practicals (3x)

Lectures

  • weekly video content
  • weekly (live) in-depth session
  • incl. Q&A

Seminars

  • led by teaching assistants
  • scheduled in B3W4, B3W8, B4W3, B4W6
  • walk-through of exercises

SPSS seminars

  • led by teaching assistants
  • coordinator: Ghislaine van Bommel
  • about implementing tests in SPSS
  • first exposure to statistical software

Our expectation

Component Amount Duration Total hours
Lectures 14 2h 28h
Seminars 4 2h 8h
SPSS labs 3 2h 6h
Weekly revision/self-study/preparation 16 6h 96h
Assessment: SPSS exam 1 2h 2h
Assessment: main exam 1 3h 3h
TOTAL - - ~140

Our expectation

  • prepare the lectures
  • watch/attend the lectures and revise them
  • make use of the seminars
  • do the homework

Materials

  • Statistics for the Behavioral Sciences (Gravetter & Wallnau)
  • SPSS survival manual (Pallant)

The course: Piazza

  • online Q&A platform
  • when in doubt: always ask!
  • we will answer questions and review your answers
  • (watch the “introduction to Piazza” session)

The course: assessment

  • Main exam
  • SPSS test

SPSS test

  • assesses your ability to perform analyses in SPSS
  • all content from the book + practicals
  • also tests the ability to interpret results
  • computerised test
  • Outcome: PASS/FAIL

Main exam

  • multiple-choice questions (e.g. correct vs incorrect; 4 options)
  • standard 1-10 grade scale
  • needed: 5.5 (after guessing-level correction)
  • date and form to be confirmed

Basic ideas in statistics

  • The idea of “data”
  • Types of statistical thinking
  • First look at distributions

Approaches of statistics

Descriptive statistics

  • about describing the data
  • often through summary statistics (Week 2)
  • e.g. on average a Spanish women is 1.63m tall
  • e.g. The wealthiest 1% own 50% of the equity/shares in companies

Approaches of statistics

Inferential statistics

  • we want to make an inference from something to something else
  • here: we want to make an inference from the sample to the population

Inferential statistics

data \(\neq\) data

  • Height (in cm)
  • Annual income (in EUR)
  • Smoker vs. non-smoker
  • Pet (dog, cat, hamster, bunny)
  • Support for Trump (from -5 to +5)

Dimensions of the “data” idea

  • Constructs vs operationalisations
  • Discrete vs continuous variables
  • Different measurement levels

Constructs vs operationalisations

Constructs vs operationalisations

Discrete vs continuous variables

Some variables can only consist of a limited number of categories:

  • e.g. gender, eye color, native language
  • but also: no. of pets, no. of siblings, how often were on holiday

There cannot be a value between 1 and 2 pets.

These variables are called discrete variables

Discrete vs continuous variables

Other variables can take all values between two points:

  • e.g. income, height, weight, speed
  • your height can, in principle, be expressed as 1.75123461736823837423 meters
  • thus a value of a continuous variable (e.g. 1.75m) is actually an interval

Measuring variables

The nominal scale

  • named categories (e.g., dog, cat, hamster)
  • no quantitative distinction between them (you cannot say a dog is more than a cat)
  • no zero!

Measuring variables

The ordinal scale

  • ranked named categories (e.g., 1st, 2nd, 3rd)
  • no equal distance between ranks
  • no zero!

Measuring variables

The interval scale

  • consists of equally-sized intervals between values
  • each unit has the same size
  • e.g. temperature:
    • going from \(21^{\circ}C\) to \(26^{\circ}C\)
    • going from \(1^{\circ}C\) to \(6^{\circ}C\)
    • both have the same difference
  • but: no real zero! (arbitrarily chosen)

Measuring variables

The ratio scale

  • consists of equally-sized intervals between values
  • each unit has the same size
  • but now we do have an absolute zero
  • e.g. distance: a distance of zero means your bike has not moved!

Representing data

Today:

  • data as a frequency distribution
  • ways to represent data
  • describing the location of datapoints

Example

How many pets do you have?

  • we ask 10 people
  • they state the number of pets that currently lives in their household

Remember:

  • the construct is “number of pets”
  • the operationalisation is “the number of pets that currently live in a person’s main household”

Our data

id pets
1 0
2 2
3 2
4 3
5 0
6 1
7 3
8 1
9 1
10 0

We may want some more structure

  • maybe we can count how often each option occurs
  • i.e. how many people have 0, 1, 2, … pets?

These are called the frequencies of values.

Frequencies of values

Var1 Freq
0 3
1 3
2 2
3 2

A structured table is then called a frequency distribution table.

Another example

  • someone’s gender
  • possible options: male - female - prefer-not-to-say
Var1 Freq
female 55
male 38
p-n-t-s 7

Freq. distributions fo continuous variables

id income
31 37900
32 37300
33 17000
34 45300
35 25800
36 33600
37 89000
38 20200
39 57900
40 20700

Problem for a table?

Var1 Freq
20700 1
21300 2
22400 1
22800 1
22900 1
23700 1
25100 1
25800 1
26700 2
27900 1

Grouped frequency distributions

Idea:

  • we bundle some value ranges together
  • we can probably lose some measurement precision here
  • example:
    • low (0-25000)
    • middle (25001-50000)
    • upper-middle (50001-75000)
    • high (75001+)

Grouped income data

Var1 Freq
high 30
low 27
middle 24
upper-middle 19

Is this ideal?

What if we have these two data collections?

  1. no. of pets (\(n=10\))
  2. no. of pets (\(n=10000\))

What do we expect to see?

Comparing the tables

X f
0 2991
1 3057
2 2997
3 472
4 483

For the small dataset

X f
0 3
1 3
2 2
3 2

Solution: proportions

X f prop
0 2991 0.2991
1 3057 0.3057
2 2997 0.2997
3 472 0.0472
4 483 0.0483

Proportion: \(p = \frac{f}{N}\)

… and percentages

X f prop perc
0 2991 0.2991 29.91
1 3057 0.3057 30.57
2 2997 0.2997 29.97
3 472 0.0472 4.72
4 483 0.0483 4.83

Percentages: \(p = \frac{f}{N}*100\)

Visual representation

Histograms

Histograms (proportions)

Histograms comparison

Locating data points

  • we may want to find where a value lies relative to the whole data
  • e.g. Are 3 pets a lot or not?
  • Where does an income of \(X=40,000\) lie in our data?

We can locate points based on the frequency distribution.

Percentiles

  1. We sort our frequency table
X f prop perc
0 2991 0.2991 29.91
1 3057 0.3057 30.57
2 2997 0.2997 29.97
3 472 0.0472 4.72
4 483 0.0483 4.83

Percentiles

  1. We sort our frequency table
  2. We calculate a cumulative percentage (same for proportions)
X f prop perc perc_cum
0 2991 0.2991 29.91 29.91
1 3057 0.3057 30.57 60.48
2 2997 0.2997 29.97 90.45
3 472 0.0472 4.72 95.17
4 483 0.0483 4.83 100.00

Percentiles

  1. We sort our frequency table
  2. We calculate a cumulative percentage (same for proportions)
  3. We locate our data point of interest (here: having 3 pets)
X f prop perc perc_cum
0 2991 0.2991 29.91 29.91
1 3057 0.3057 30.57 60.48
2 2997 0.2997 29.97 90.45
3 472 0.0472 4.72 95.17
4 483 0.0483 4.83 100.00

Interpreting percentiles

  • We know that 3 pets corresponds to a cumulative percentage of 95.17%
  • i.e. 95.17% of our data has been accumulated once we reach 3 pets (inclusive)
  • 95.17% of responses are covered by 0, 1, 2, or 3 pets.

“3 pets” has a percentile rank of 95.17%

“3 pets” is the 95th percentile

Income data

X f perc perc_cum
800 1 1.0526 1.0526
1100 1 1.0526 2.1052
1500 1 1.0526 3.1578
4700 1 1.0526 4.2104
5700 1 1.0526 5.2630
9200 1 1.0526 6.3156
9300 1 1.0526 7.3682
10300 1 1.0526 8.4208
10400 1 1.0526 9.4734
11100 1 1.0526 10.5260

Obtaining percentiles

Where does an income of \(X=40000\) lie in our data?

X f perc perc_cum
37800 1 1.0526 46.3146
37900 1 1.0526 47.3672
38500 1 1.0526 48.4198
41900 1 1.0526 49.4724
43600 1 1.0526 50.5250

An income of 40,000 has a percentile rank of 48.42%.

Recap

  • intro to the module
  • first steps
  • frequency distributions
  • locating data points

Next week

Understanding data further:

  • central tendency of data
  • variability of data

Contact + questions